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Abstract 

The computation of good image descriptors is key to the instance retrieval problem and has been the 
object of much recent interest from the multimedia research community. With deep learning becoming the 
dominant approach in computer vision, the use of representations extracted from Convolutional Neural 
Nets (CNNs) is quickly gaining ground on Fisher Vectors (FVs) as favoured state-of-the-art global image 
descriptors for image instance retrieval. While the good performance of CNNs for image classification 
are unambiguously recognised, which of the two has the upper hand in the image retrieval context is not 
entirely clear yet. 

In this work, we propose a comprehensive study that systematically evaluates FVs and CNNs for 
image retrieval. The first part compares the performances of FVs and CNNs on multiple publicly available 
data sets. We investigate a number of details specific to each method. For FVs, we compare sparse 
descriptors based on interest point detectors with dense single-scale and multi-scale variants. For CNNs, 
we focus on understanding the impact of depth, architecture and training data on retrieval results. Our 
study shows that no descriptor is systematically better than the other and that performance gains can 
usually be obtained by using both types together. The second part of the study focuses on the impact of 
geometrical transformations such as rotations and scale changes. FVs based on interest point detectors 
are intrinsically resilient to such transformations while CNNs do not have a built-in mechanism to ensure 
such invariance. We show that performance of CNNs can quickly degrade in presence of rotations while 
they are far less affected by changes in scale. We then propose a number of ways to incorporate the 
required invariances in the CNN pipeline. 

Overall, our work is intended as a reference guide offering practically useful and simply imple- 
mentable guidelines to anyone looking for state-of-the-art global descriptors best suited to their specific 
image instance retrieval problem. 

* V. Chandrasekhar, J. Lin and O. Morere contributed equally to this work. 
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Index Terms 

convolutional neural networks, Fisher vectors, image instance retrieval. 

I. Introduction 

Image instance retrieval is the discovery of images from a database representing the same object or scene 
as the one depicted in a query image. State-of-the-art image instance retrieval pipelines consist of two 
major blocks: first, a subset of images similar to the query are retrieved from the database, next, geometric 
consistency checks arc applied to select the relevant images from the subset with high precision. The first 
step is based on the comparison of global image descriptors : high-dimensional vectors with up to tens of 
thousands of dimensions representing the image contents. Better global descriptors are key to improving 
retrieval performance and has been the object of much recent interest from the multimedia research 
community with work on specific applications such as digital documents JH, mobile visual search (TJ, 
Ell , distributed large scale search 0 and compact descriptors for fast real-world applications 0, 0. 

A popular global descriptor which achieves high performance is the Fisher Vector (FV) 0. The FV is 
obtained by quantizing the set of local feature descriptors with a small codebook of 64-512 centroids, and 
aggregating first and second order residual statistics for features quantized to each centroid. The residual 
statistics from each centroid are concatenated together to obtain the high-dimensional global descriptor 
representation, typically 8192 to 65536 dimensions. The performance increases as the dimensionality of 
the global descriptor increases, as shown in 0. FVs can be aggregated on descriptors extracted densely 
in the image 0, or around interest points like Difference-of-Gaussian (DoG) interest points 0. The 
former is popular for image classification tasks, while the latter is used in image retrieval as the DoG 
interest points provide invariance to scale and rotation. 

As opposed to the carefully hand-crafted FVs, deep learning has achieved remarkable performance for 
large scale image classification ||9J, flOll . Deep learning has also achieved state-of-the-art results in many 
other visual tasks such as face recognition |[ffl . lfT2l . pedestrian detection lfl3l and pose estimation lf3~4il . 
In their recent work, Babenko et al. |[T5l propose using representations extracted from Convolutional 
Neural Nets (CNN) as a global descriptor for image retrieval, and show promising initial results for the 
approach. In our work, we show how stacked Restricted Boltzmann Machines (RBM) and supervised 
fine-tuning can be used for generating extremely compact hashes from global descriptors obtained from 
CNNs for large scale image-retrieval lH6l . 
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TABLE I 

Summary of Experimental Results and Key Findings 


Questions 

Observations and Recommendations 

Best practices for CNN descriptors 

Best single crop strategy? 

Best performing layer? 

Do deeper networks help? 

How much does training data matter? 

The largest possible center crop (discarding parts of the image but preserving aspect 
ratio) or the entire image (preserving the entire image but ignoring aspect ratio) work 
comparably, both outperforming padding (preserving both). 

The first fully connected layer is a good all-round choice on all the tested models. 
Only if the training and test data are similar. Else, extra-depth can hurt performance. 
Training data has significant impact on performance. Results also suggest that deeper 
layers are more domain specific. 

Best practices for FV interest points 

Dense or sparse interest points? 

Single-scale or multi-scale interest points? 

It depends on the dataset. If scale and rotation invariance are not required, and the 
data are highly textured, dense sampling outperforms DoG interest points. 
Multi-scale interest points always improve performance. 

CNN versus FV 

How do state-of-the-art CNN and FV results 
compare on standard benchmarks? 

Does combining FV and CNN improve perfor¬ 
mance ? 

It depends on the characteristics of the data set. 

Yes, combining FV with state-of-the-art CNN descriptors can improve retrieval 
performance often by a significant margin. 

Invariance to rotations 

How invariant are CNN features to rotation? 

Are CNNs or FVs more invariant to rotation? 

How do we gain rotation invariance for CNN 
features? 

Are deeper CNN layers more invariant to rota¬ 
tion ? 

CNN features exhibit very limited invariance to rotation, performance drops rapidly 
as query rotation angle is varied. 

FV based on DoG interest points are robust to rotation changes, as would be expected. 
CNN descriptors are more robust to rotation changes than FV based on dense 
sampling. 

Max-pooling across rotated versions of database images works well, at the loss of 
some discriminativeness when query and database images are aligned. However, the 
same max-pooling approach is not effective on dense FVs. 

The fully connected layers exhibit similar invariance properties to rotation. Visual 
features (pool5 ) are slightly more robust to small rotation angles but significantly 
less robust to larger angles. 

Invariance to scale changes 

How scale-invariant are CNN features? 

Are CNNs or FVs more scale-invariant? 

How do we gain scale invariance for CNN 
features? 

Are deeper CNN layers more scale-invariant? 

CNN descriptors are robust to scale change and work well even for small query 
scales. 

CNN descriptors are more robust to scale changes than any FV. All FV variants 
experience a much sharper drop in performance as query scale is decreased compared 
to CNN features. 

Similar to rotation invariance, max-pooling across scaled versions of database images 
works well for gaining scale invariance, at the cost of some discriminativeness. 
Visual features (poo!5) are more scale-invariant than the deeper fully connected layers. 
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While deep learning has unquestionably become the dominant approach for image classification, the 
case for image retrieval has yet to be clearly settled. The two types of descriptors being radically different 
in nature, one can expect them to behave very differently based on specific aspects of the problem. On 
one hand, CNNs seem to obtain good retrieval results with more compact starting representation but 
many factors related to the network architecture or the training may come into play. On the other hand, 
FVs may be more robust to training data and more invariant to certain geometrical transformations of the 
images. In fact, some of the best reported instance retrieval performances are still based on hand-crafted 
features such as FVs El- 

In this work, we perform a thorough investigation of approaches based on FVs and CNNs on multiple 
publicly available datasets and analyse the pros and cons of each. The first part of the study determines 
best practices for FVs and CNNs on details specific to each of the approach. For FVs, we investigate the 
effects of spare SIFT based on interest point detectors versus dense SIFT (single-scale and multi-scale). 
For CNN descriptors, we specifically study the impacts of image cropping strategies, layer extracted from 
the CNN, network depth, and training data. Next, we investigate how each type of descriptors performs 
compared to the other and if a combination of both types of descriptors can improve results over to the 
best FVs and CNN descriptors. The final part of our work is dedicated to the impact of geometrical 
transformations such as rotations and scale changes. Unlike FVs based on interest point detectors, CNNs 
do not have a built-in mechanism to ensure resilience to such transformations. Hence it is necessary to 
understand how much CNN descriptors are affected by them. We also propose a number of ways to 
incorporate transformation invariance in the CNN pipeline. 

Our work provides a set of straightforward practical guidelines, some valid in general and some 
problem dependent, one should follow to get the global image descriptors best suited to their specific 
image instance retrieval task. 


II. Related Work 

There has been extensive work on the FV and its variants since it was first proposed for instance 
retrieval. Several improvements to the baseline FV |[6]] have been proposed in recent literature, including 
the Residual Enhanced Visual Vector lfl8l and the Rate-adaptive Compact Fisher Codes (RCFC) |fl9ll . 
Recent improvements also include better aggregation schemes GUI , and better matching kernels ifTTI . 
State-of-the-art results using FVs are based on aggregating statistics around interest points like Difference- 
of-Gaussian |[S] or Hessian-affine interest points GTIl . 

CNNs are now considered to be the mainstream approach for large-scale image classification. ImageNet 
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2014 submissions are all based on CNNs. After the winning submission of Krizhevsky et al. in the 
ImageNet 2012 challenge j9|, CNN began to be applied to the instance retrieval problem as well. There 
is comparatively less work on CNN-based descriptors for instance retrieval compared to large-scale image 
classification. Razavian et al. fl22ll evaluate the performance of CNN model of |9| on a wide range of tasks 
including instance retrieval, and show initial promising results. Babenko et al. fT5l show that fine-tuning 
a pre-trained CNN with domain specific data (objects, scenes, etc) can improve retrieval performance 
on relevant data sets. The authors also show that the CNN representations can be compressed more 
effectively than their Fisher counterparts for large-scale instance retrieval. In Ifl6l . we show how sparse 
high-dimensional CNN representations can be hashed to very compact representations (64-1024 bits) for 
large scale image retrieval with little loss in matching performance. 

While the papers above show initial results, the CNN architecture and features from J9l are used as a 
black-box for the retrieval task. There is no systematic study of how the CNN architecture and training 
data affect retrieval performance. Also, unlike interest points which provide scale and rotation invariance 
to the FV pipeline, CNN representations used in image-classification are obtained by densely sampling 
a resized canonical image. CNN features do not provide explicit rotation and scale invariance, which are 
often key to instance retrieval tasks. Desired levels of scale and rotation invariance for CNN features can 
nevertheless be indirectly achieved from the max-pooling operations in the pipeline, the diversity of the 
training data which typically contains objects at varying scales and orientations, and data augmentation 
during the training phase where data can be preprocessed and input to the CNN at different scales and 
orientations. 


Fisher Vector 



Fig. 1. FV and CNN based pipelines for the computation of global image descriptors. 
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In this work, we provide a systematic and thorough evaluation of FV and CNN pipelines (see Figure [TJ 
for instance retrieval. We run extensive experiments on 4 popular data sets: Holidays Il23l . UKBench l(24l . 
Oxford buildings ll25ll and Stanford Mobile Visual Search ll26ll to study how well CNN-based approaches 
generalize compared with FVs. Our CNN experiments in this work arc based on publicly available 
CNN models in Caffe li27l and can be fully reproduced, unlike CNN models trained by Google, Baidu, 
Microsoft and Yandex in lf28ll . ll29l . lfT5l l. ll30l . 

III. Contributions 

The main contributions of our work are summarized as follows: 

• We provide a comprehensive and systematic evaluation of FVs and CNN descriptors for instance 
retrieval. Our results based on many standard, publicly available dataset and various pre-trained 
state-of-the-art models for image classification are fully reproducible. 

• We identify the best practices for the use of each type of descriptors through a set of dedicated 
experiments. For CNNs, we investigate the impacts of the image cropping strategy, the network 
depth, the layer selected as descriptors, and the training data. For FVs, we study how densely 
sampled SIFT single-scale and multi-scale descriptors compare with against sparse interest point 
detectors. 

• We compare the best performing FVs and CNN descriptors from our study to various reported 
state-of-the-art results on the various datasets. We also investigate if a mixture of FVs and CNN 
descriptors is able to further improve results. 

• Unlike FVs based on interest point detectors, CNNs do not have a built-in mechanism to ensure 
robustness to transformations such as rotations or scale changes. We therefore conduct a set of 
experiments to compare the performance and robustness of the two types of descriptors when affected 
by rotations and scale changes. We also propose a number of ways the descriptors could be made 
more invariant to those transformations. 

• The key findings from our study are summarized in Table |I] intended as a quick reference guide for 
practical guidelines on the use of FVs and CNNs for image retrieval. The guidelines are sometimes 
general but often dependent on specific characteristics of the problem which have been properly 
identified in this study. 
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IV. Evaluation Framework 

A. Data Sets 

We evaluate the performances of the descriptors against four popular data sets: Holidays, Oxford 
buildings (Oxbuild), UKBench and Graphics. The four datasets are chosen for the diversity of data 
they provide: UKBench and Graphics are object-centric featuring close-up shots of objects in indoor 
environments. Holidays and Oxbuild are scene-centric datasets consisting primarily of outdoor buildings 
and scenes. 

INRIA Holidays. The INRIA Holidays dataset |[23l consist of personal holiday pictures. The dataset 
includes a large variety of outdoor scene types: natural, man-made, water and fire effects. There are 500 
queries and 991 database images. Variations in lighting conditions are rare in this data set as the pictures 
from the same location arc taken at the same time. 

Oxford Buildings. The Oxford Buildings Dataset ll25l consists of 5062 images collected from Flickr 
representing landmark buildings in Oxford. The collection has been manually annotated to generate a 
comprehensive ground truth for 11 different landmarks, each represented by 5 possible queries. Note that 
the set contains 55 queries only. 

UKBench. The University of Kentucky (UKY) data set li24l consists of 2550 groups of common 
objects. There are 4 images representing each. Only the object of interest present in each image. Thus, 
there is no foreground or background clutter within this data set. All 10200 images are used as queries. 

Graphics. The Graphics data set is part of the Stanford Mobile Visual Search data set |[26l . which 
notably was used in the MPEG standard: Compact Descriptors for Visual Search (CDVS) ITil l. The data 
set contains different categories of objects like CDs, DVDs, books, software products, business cards, etc. 
For product categories (CDs, DVDs and books), at least one of the references is a clean version of the 
product obtained from the product website. The query images include foreground and background clutter 
that would be considered typical in real-world scenarii, e.g., a picture of a CD might contain other CDs 
in the background. This data set distinguishes from the other ones as it contains images of rigid objects 
captured under widely varying lighting conditions, perspective distortion, foreground and background 
clutter. Query images are taken with heterogeneous phone cameras. Each query has two relevant images. 
There are 500 unique objects, 1500 queries, and 1000 database images. 

B. Fisher Vectors 

FVs are a concatenation of first and second order statistics of a set of feature descriptors quantized 
with a small codebook. We resize all images (maintaining aspect ratio) so that the larger dimension of 


the image is equal to 640 pixels prior to FV extraction. We use the implementation of FVs from the open 
source library VLFeat (321 . SIFT detectors and descriptors are also chosen from the same library. The 
three different types of SIFT descriptors used to generate the FVs are Difference of Gaussians (DoG) 
SIFT, Dense Single-scale SIFT and Dense Multi-scale SIFT. 

• DoG SIFT. We detect interest points in the DoG scale space, followed by 128-dimensional SIFT 
descriptors extracted from scaled and oriented patches centered on interest points. Default peak and 
edge thresholds (0 and 10) are employed to filter out low contrast patches or patches close to the 
edge of the image. Since the DoG detector extracts scale and rotation invariant interest points, it 
has been widely applied for the task of instance retrieval. It is important to note that we do not use 
any feature selection algorithm to select a subset of “good” features - an approach that can result 
in a significant improvement in performance on the Graphics data set 11331 . 

• Dense Single-scale SIFT. We extract SIFT descriptors from densely sampled patches (every 4 pixels) 
with fixed scale and upright orientation. The patch size used for the extraction is m x s where s 
is the scale parameter and m is the magnification parameter. We choose the default magnification 
parameter m = 6 across all dense SIFT descriptors, s = 4 is chosen for single-scale SIFT. Dense 
SIFT is faster to compute than DoG SIFT as the expensive interest point detection step is avoided 
- however, this comes at the cost of lower scale and rotation invariance. Note that dense SIFT is 
mostly popular for image classification tasks. 

• Dense Multi-Scale SIFT. We apply dense SIFT extraction at multiple resolutions (s = {4, 8,12,16}). 
This is aimed at gaining some degree of scale invariance. 

Closely following (34lli6l . we apply dimensionality reduction on SIFT descriptors from 128 to 64 using 
PCA, and train a Gaussian Mixture Model (GMM) with 256 centroids. Both first order (gradients w.r.t. 
mean) and second order (gradients w.r.t. variance) statistics are encoded to form the FV, resulting in a 
64 x 256 x 2 = 32768-dimensional vector representation for each image. Finally, we apply power law 
normalization to each component (a = 0.5), followed by L 2 normalization to obtain the final normalized 
FV representation (61. Each dimension of the FV is stored as a floating point number. No compression 
is applied. We refer to the three FV as FVDoG (FV computed on DoG points), FVDS (FV computed 
densely at a single scale) and FVDM (FV computed densely at multiple scales) from here on. 

C. Convolutional Neural Net features 

In this work, we consider four different pre-trained CNN models for the instance retrieval problem: 

• OxfordNet (35ll : the best performing single network from the Oxford VGG team at ImageNet 2014. 
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• AlexNet the model referenced as “BVLC reference caffenet” in the Caffe framework li27ll . This 
model was the winning ImageNet submission of 2012. This network closely mimics the original 
AlexNet model of j9j. 

• PlacesNet lf36ll : a state-of-the-art model for scene image classification providing highest accuracy 
on the SUN397 dataset 071 . 

• HybridNet 061 : another model for both object and scene images classification, outperforming state- 
of-the-art methods on the MIT Indoor67 dataset || 38| . 

Details on the architecture, training set and layer sizes of the CNNs are summarized in Table [TT| 

TABLE II 

Details on architecture, training set and layer size of the CNNs. 




Architecture 


Training 


Layer Size 


parameters 

depth (conv+fc) 

input size 

training set 

classes 

data size 

pool5 

fc6 

fc7 

fc8 

OxfordNet 

138M 

13+3 

224 X 224 X 3 

ImageNet 

1000 

1.2M 

7 x 7 x 512 

4096 

4096 

1000 

AlexNet 

60M 

5+3 

227 x 227 x 3 

ImageNet 

1000 

1.2M 

6 x 6 x 256 

4096 

4096 

1000 

PlacesNet 

60M 

5+3 

227 x 227 x 3 

Places-205 

205 

2.4M 

6 x 6 x 256 

4096 

4096 

205 

HybridNet 

60M 

5+3 

227 x 227 x 3 

Both 

1183 

3.6M 

6 x 6 x 256 

4096 

4096 

1183 


These state-of-the-art models are chosen as they allow us to run interesting control experiments, where 
the CNN architecture or training data are varied. PlacesNet and HybridNet share the same architecture as 
AlexNet | [36l l, while being trained on different data. OxfordNet and AlexNet are trained on the same data, 
but have different architectures: compared to AlexNet, OxfordNet is deeper, has twice as many layers, 
twice the number of parameters, and achieved better image classification performance in the ImageNet 
2014 contest 051 . 

The 4 models are trained differently, using the ImageNet ||39l and Places-205 |f36l datasets. With 
categories like “Amphitheater”, “Jail cell” or “Roof garden”, Places-205 is a scene-centric dataset, while 
ImageNet, featuring categories such as “Vending machine”, “Barn spider” or “Chocolate syrup”, is more 
object-centric. Places-205 is twice as large as ImageNet, but has 5 times fewer classes. OxfordNet and 
AlexNet are trained on ImageNet. HybridNet is trained on a combination of ImageNet and Places-205 
data: the resulting dataset being 3 times larger than ImageNet alone, and having a larger variety of classes. 

Given an input image, we first resize it to a canonical resolution, compute the feed-forward neural 
network activations, and extract the last four layers for each CNN model. We refer by poo/5, /c6, fc7 
and fc8 outputs of the last 4 layers of each network (as denoted in Caffe), poo/5 is the output of the last 
convolutional layer after pooling, and /c6, fc7, fc8 are outputs of the fully connected layers, poo/5 still 
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contains spatial information from the input image. The size of the last layer fc8 is equal to the number 
of classes. All descriptors are extracted after applying the rectified linear transform, and Lo normalized: 
the features are directly output from the Caffe implementation of the CNN models. 


V. Experimental Results 



Original image (VGA) 


224 

A 



>224 


Squished 


Fig. 2. Different single-crop strategies used for input into CNN pipelines. 


Holidays 



Fig. 3. MAP for different layers of OxfordNet, for different single-crop strategies on the Holidays data set. We observe that 
Center crop and Squish perform comparably. 


A. Best practices for CNN descriptors 

What single crop strategy is the best? 

CNN pipelines take input images at a fixed resolution (see Table 0- We wish to determine which 
single-crop strategy works best in the context of instance retrieval where images may vary in size and 
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aspect ratio. We consider the following 3 different cropping strategies illustrated in Figure [2] Numerical 
values are given to lit OxfordNet. 

• Center, the largest 224 x 224 center crop, after rescaling the image to 224 pixels for the smaller 
dimension, while maintaining aspect ratio. 

• Padding-, the original image is resized to 224 pixels for the larger dimension, maintaining aspect 
ratio, and any unfilled pixels are padded with a constant value equal to the training set mean. 

• Squish: the original image is resized to 224 x 224. The original aspect ratio is ignored potentially 
resulting in distortions. 

In Figure |3j we plot MAP for different layers of OxfordNet, for the Holidays data set. We note that Center 
and Squish perform comparably, outperforming Padding. The trend is consistent across the different 
network layers. We observe similar results for other data sets and CNN models. Most data sets in this 
study have a center bias for the object of interest, explaining the best performances of the Center cropping 
strategy. 

A small improvement in performance for large-scale image classification is obtained by averaging 
output class probabilities computed over several cropped regions within an image, often extracted at 
different positions and scales ifTOl . Such a performance improvement can also be achieved for instance- 
retrieval by pooling CNN results over several cropped regions, but such a strategy could be applied to 
other global descriptor pipelines too. For the remaining experiments in this paper, we consider a single 
Center crop for processing all database and query images. 

Which CNN layer performs the best? 

In Figure [4j we plot MAP for the last 4 layers of OxfordNet, AlexNet, PlacesNet and HybridNet 
for different data sets. We note that for each network, intermediate layers perform best for instance 
retrieval. Such a sweet spot is intuitive as the final layer represents higher level semantic concepts, while 
intermediate convolutional and fully connected layers provide rich representations of low level image 
information. We note that layer /c6 performs the best for all CNN, for all data sets except Graphics. For 
Graphics, performance drops with increase in depth, as all four CNN models are learnt on natural image 
statistics, while the Graphics data set is biased towards data like CD covers, DVD covers, business cards, 
and dense text in newspaper articles. 

How much improvement can we obtain by deeper CNN architectures? 

We compare OxfordNet and AlexNet results in Figure [4j OxfordNet and AlexNet are both trained on 
the same 1.2 million images from the ImageNet data set, but vary in the number of layers: 16 and 8 
layers respectively. We note that OxfordNet outperforms AlexNet on all data sets, except Graphics. On 
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Holidays 


Oxbuild 





Layer 

Graphics 



Layer 


Fig. 4. MAP for the last 4 layers of state-of-the-art publicly available CNN. OxfordNet and AlexNet are trained on the same 
data, while PlacesNet , HybridNet and AlexNet have the same network architecture but are trained on different data. We note 
that performance improves by using deeper networks, and by training on domain specific data, but only if training and testing 
data have similar characteristics. 


Graphics, the performance of OxfordNet is worse, strongly suggesting that performance improves with 
more layers as long as the training data is representative of the test data set in consideration. 

How much improvement can we obtain by training CNN models using domain specific data? 

For this experiment, we compare AlexNet with PlacesNet in Figures [4] AlexNet, and PlacesNet use the 
same 8-layer CNN architecture, but are trained on different data. We observe that PlacesNet outperforms 
AlexNet on Holidays and Oxbuild data sets. This shows that using training data more representative 
of the test data can improve performance significantly, as Plolidays and Oxbuild are scene-centric. On 
the object-centric UKBench and Graphics data sets, PlacesNet performs worse than AlexNet due to the 
mismatch between training and test data. 
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Further, in Table III we compare our results to the CNN retrieval results presented in ff5ft . In | |T5| . 
Babenko et al. fine-tune a pre-trained AlexNet model based on ImageNet training data with domain 
specific images, e.g., landmarks and objects. As shown in Table |TITJ the authors are able to improve 
retrieval performance over the AlexNet baseline model, on Holidays and UKBench by fine-tuning with 
landmark and object data respectively. Flowever, the resulting trade-off is a loss in performance on 
Holidays when fine-tuning with object data and vice-versa. 

We compare OxfordNet results with those of the fine-tuned models of ||T31 . We note that the deeper 
architecture of OxfordNet trained on just ImageNet data results in comparable or higher performance 
than the fine-tuned models on both Holidays and UKBench, suggesting that there is more gain to be had 
with deeper networks rather than fine-tuning a shallower network with domain-specific data. 

Together with the previous sets of experiments, there is strong combined evidence that deeper lay¬ 
ers/models have the potential of achieving higher discriminativeness on domain specific data at the expense 
of less generalisability on non-specific data. 

How much improvement can we obtain by training CNN on larger and more diverse data? 

For this experiment, we compare AlexNet with HybridNet in Figure [4] AlexNet, and HybridNet use 
the same 8-layer CNN architecture, but the latter is trained on a combination of ImageNet and Places- 
205, resulting in a larger training data set with more diverse classes. We note that HybridNet performs 
comparably or better than AlexNet on all data sets except Graphics, suggesting that increasing the amount 
and diversity of training data is equally important as increasing depth in the CNN architecture. 


TABLE III 

State-of-the-art CNN and FV results for instance retrieval. 4 x Recall @ 4 for UKBench, and MAP for 

OTHER DATA SETS. 


Descriptor 

Dim 

Holidays 

UKBench 

Oxbuild 

Graphics 

OxfordNet 

4096 

0.80 

3.54 

0.46 

0.33 

AlexNet 

4096 

0.76 

3.38 

0.42 

0.37 

HybridNet 

4096 

0.81 

3.39 

0.48 

0.36 

PlacesNet 

4096 

0.80 

3.11 

0.46 

0.33 

CNN (Fine-tuned 
on Landmarks) 1151 

4096 

0.793 

3.29 

0.545 


CNN (Fine-tuned 
on Objects) ff5l 

4096 

0.754 

3.56 

0.393 


FVDoG 

32768 

0.63 

2.8 

0.42 

0.66 

FVDS 

32768 

0.73 

2.38 

0.51 

0.20 

FVDM 

32768 

0.75 

2.45 

0.55 

0.32 







14 


B. Best practices for FV interest points 

CNN features arc obtained by dense sampling over the image. In Table [ill} we study if such an approach 
is also effective for FVs. We compare the performance of FVDoG, FVDS and FVDM as described in 
Section |IVJ 

We note that dense sampling (FVDS and FVDM) improves performance over FVDoG on Holidays 
and Oxbuild data sets, while hurting performance on Graphics and UKBench. Note the large drop in 
performance of dense sampling on the Graphics data set. This is intuitive as queries in the Graphics 
data set contain query objects at different scales and rotations. For Holidays and Oxbuild data sets, even 
FVDS improves performance over FVDoG, suggesting that most query and database image pairs occur 
at roughly the same scale. 

Dense sampling is effective for data sets like Holidays which consist primarily of outdoor scenes, and 
are mainly composed of highly textured patches. The improvement in performance of dense sampling 
approaches can also be attributed to the discriminativeness-invariance tradeoff. Where retrieval does not 
require scale and rotation invariance, and data are highly textured over the entire image, performance can 
be improved by dense sampling. 

Sampling at multiple scales also seems to consistently improve results over single scale sampling for 
dense descriptors. 

C. Comparisons to state-of-the-art 

Does combining FV and CNN improve performance? 

In Figure [5} we present retrieval results obtained by combining FVDoG, FVDS, and FVDM individually 
with OxfordNet fc6 features. We employ a simple early fusion approach where the FV and CNN features 
are concatenated after weighting by a and (1 — a) respectively, a = 0 corresponds to using just FVDoG, 
FVDS or FVDM features individually, while a = 1 corresponds to just the OxfordNet feature. This early 
fusion scheme is also equivalent to weighting the squared L 2 distance measure for matching by a and 
1 — a for FV and CNN features respectively. 

All four data sets show an improvement in peak performance by combining FV and CNN features. The 
maximum performance is achieved for a = 0.4 for the Holidays, UKBench and Oxbuild data sets, and 
a = 0.3 for the Graphics data set, using different FV. There is a significant improvement in performance 
by combining FV and CNN features on all data sets except Graphics. The results suggest that a simple 
hyperparameter can be used to combine FV and CNN across data sets with similar characteristics. Also, 
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a = 0.4 suggests that FV contribute significantly to achieving high retrieval performance (in contrast, an 
a parameter close to 1 would suggest that most of the contribution is from the CNN feature). 

Note that our goal here is to show that performance can be improved significantly by combining FV and 
CNN features, and not necessarily to achieve highest performance on these retrieval benchmarks. Peak 
performance presented in Figure [5] can be improved by (a) database-side rotation and scale pooling which 
helps significantly (see Sections |V-D| and |V-E ) (b) better CNN models than OxfordNet on individual data 
sets (see Table [IlT|) (c) better FV based on Flessian Affine interest points l[2~il instead of DoG points used 
in this paper 021 (d) better FV with more sophisticated aggregation techniques f20l . and (e) combining 
all FV and CNN descriptors together, (f) using more sophisticated fusion and ranking techniques for 
combining results, like the one proposed in the recent paper C40l . 


TABLE IV 

State-of-the-art results. MAP for Oxbuild and Holidays, and 4x Recall @4 for UKBench 


Descriptor 

Dim 

Holidays 

UKBench 

Oxbuild 

Bag-of-words 1M (24l 

1M 


3.19 


VLAD baseline PTI 

8192 

0.526 

3.17 


Fisher baseline ||41| 

8192 

0.495 

3.09 


Fisher baseline (ours) 

32768 

0.63 

2.8 

0.42 

Fisher ADC (320 bytes) El 

2048 

0.634 

3.47 


Fisher+color f42l 

4096 

0.774 

3.19 


VLAD++ |43l 

32768 

0.646 


0.555 

Sparse-coded features 11441 

11024 

0.767 

3.76 


Triangulation Embed |20|| 
Triangulation Embed (201 

8064 

1920 

0.77 

3.53 

0.676 

Best CNN results 
from this paper 
across all CNN 

4096 

0.81 

3.54 

0.48 

Fusion of OxfordNet and 
Baseline FV 

32768+ 

4096 

0.85 

3.71 

0.59 


Comparisons to state-of-the-art. 

In Table [TV] we compare state-of-the-art results reported on Holidays, UKBench and Oxbuild. We 
include a wide range of approaches starting from Bag-of-words ll24l to latest FV aggregation methods I I20H . 
We include the best CNN and fusion results reported in this paper. We note that the best CNN results 
(based on pre-trained models considered in this work) achieve higher performance than state-of-the-art 
FV approaches ll20l on Holidays and UKBench data sets. There is a gap in performance between CNN 
results reported in this work and state-of-the-art FV for Oxbuild: however, Oxbuild is a much smaller 
data set with only 55 queries. Finally, we note that the simple fusion technique in Figure [5] results in 
highest or one of the highest performance numbers reported on each data set. Peak performance numbers 
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—•— FVDS 
—•— FVDM 


UKBench 

3.8 


Graphics 



Alpha 


Alpha 


Fig. 5. Combining different FVs with O.xfordNet fc6 with early fusion. FV and OxfordNet features are concatenated with weights 
a and 1 — a respectively, a = 0 refers to just using FVDoG, FVDS or FVDM above, while a = 1 refers to just the OxfordNet 
fc6 feature. We observe that retrieval performance improves on all data sets by combining FV and CNN. 


for the fusion approach can be improved using approaches (a)-(f) described above. 

D. Invariance to Rotation 

How invariant are CNN features to rotation? 

CNN features, unlike FVDoG, have limited levels of rotation invariance. The invariance arises from 
the max-pooling steps in the CNN pipeline, and rotated versions of objects present in the training data. 
In Figure |7J we rotate each query at different angles and measure MAP for the Holidays data set for 
different layers of OxfordNet. For these control experiments, query images are cropped circularly in 
the center (to avoid edge artifacts) and rotated in steps of 10°. The same experimental set up is also 
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Fig. 6. We extract features from database images at different rotations and pool them to obtain a single representation. We 
rotate queries and evaluate retrieval performance for different pooling parameters and strategies. All database and query images 
are cropped circularly at the center to avoid edge artifacts for this experiment, and padded with a default mean RGB value 
(ImageNet mean). 


Holidays 



Fig. 7. MAP as query images are rotated for different layers of OxfordNet. We note that CNN features have very limited rotation 
invariance, with performance dropping steeply for all layers of the network beyond 10°. 


employed in evaluation of rotation invariant features in ||45l . We note that CNN features have very limited 
rotation invariance with performance dropping steeply beyond 10°. Furthermore, the different layers of 
the network exhibit similar characteristics suggesting that rotation invariance does not increase with depth 
in the CNN. 

Are FVs or CNNs more rotation invariant? 
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-100 -50 0 50 100 

Query Rotation Angle 


Fig. 8. Comparison of OxfordNet fc6 and FV for rotated queries on the Holidays and Graphics data sets. The FVDoG is robust 
to rotation, while OxfordNet, FVDS, FVDM suffer a sharp drop in performance. 


For the sake of evaluation, we choose one scene-centric and one object-centric data set that arc most 
different: Holidays and Graphics. In Figure [8} we compare the performance of FV variants and OxfordNet 
fc6 as queries are rotated at different angles. Note that FVDoG is robust to rotation - the minor modulation 
in performance is due to filtering artifacts in the DoG interest point detector. However, the OxfordNet 
features, FVDS and FVDM have a steep drop in performance as queries are rotated. The OxfordNet 
features are more rotation invariant than FVDS and FVDM. Finally, note the large gap in performance 
between FVDoG and other schemes for the Graphics data set. The gap in performance on Graphics 
arises from two contributing factors: (a) the worse performance of the OxfordNet features on this data 
set, and (b) the fact that there are several rotated queries on which the OxfordNet features perform worse. 
The effect of each can be isolated from the next set of experiments we conduct. Next, we discuss how 
to gain invariance for the CNN, FVDS and FVDM pipelines Ideally, we desire invariance to rotation, 
while maintaining high discriminability. 

How do we gain rotation invariance for CNN features? 

We propose a database pooling scheme illustrated in Figure [6] for gaining rotation invariance. Each 
database image is rotated within a range of — p° to p°, in steps of 10°. The CNN features for each rotated 
database image are pooled together into one common global descriptor representation. In Figure [9j we 
present results for max-pooling, where we store the component-wise maximum value across all rotated 
representations in an angular range. P = 0 refers to no pooling, while P = p refers to pooling in the 
range of — p° to p° in steps of s = 10°. The parameter s indicates the quantization step size of angular 
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Fig. 9. MAP vs query rotation angle for different pooling parameters. Results are presented on OxfordNet fc6 layer on the 
Holidays data set. P = 0 refers to no pooling. P = p refers to max-pooling over individual feature dimensions, for rotations 
between — p° and p° in steps of s = 10°. Invariance to rotation increases with increasing p, at the expense of lower performance 
at angle 0°. 


rotation of database images. 

We plot performance as query rotation angle is varied for varying pooling parameter P, on OxfordNet 
fc6 layer for the Holidays and Graphics data sets. The invariance-discriminativeness trade-off is shown 
in Figure [9] We observe that the max pooling scheme performs surprisingly well for gaining rotation 
invariance. As P is increased, the performance curve flattens in the range of — P° to P°, at the expense of 
lower performance for upright queries, i.e, at angle 0. For the Holidays data set, most database and query 
images share similar “upright” orientations. For the Graphics data set, note the gap in performance of 
different schemes at angle 0, between no pooling and different pooling schemes. This gap in performance 
can be attributed to rotated objects in the query data set. The remaining gap at original angle 0° between 
FVDoG in Figure [8] and CNN features in Figure [9] can be attributed to the worse performance of the 
OxfordNet features for this data set. 

To further understand the effectiveness of database-side pooling, we evaluate different types of pooling 


methods and database augmentation techniques in Figure 10 We show results for component-wise max 
pooling and average pooling over rotated database images for different pooling parameters P. We note 
that max pooling and average pooling perform comparably for small P. For P = 180°, we note that 
average pooling outperforms max pooling. 


We compare the two pooling strategies to a simple database augmentation technique labeled Min-dist, 
which stores descriptors for each rotated version of the database image. For Min-dist, at query time. 
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Fig. 10. Comparison of different types of database pooling and augmentation techniques, for varying pooling parameter P. 
OxfordNet layer /c6 is used on the Holidays data set. We notice that max and average pooling come close to the perform of 
Min-dist and Min-dist(PWL), which require storage of multiple feature descriptors (=j- + 1) for each database image, where 
s = 10 is the quantization step size in degrees. 


we compute the minimum distance to all the rotated versions for each database image. The Min-dist 
increases the size of the database by + 1, where s = 10 is the step size in degrees, and P is 
the pooling parameter. For small s —> 0, the Min-dist scheme provides an approximate upper bound 
on the performance that the max and average pooling schemes can achieve, as a descriptor for each 
rotated version is explicitly stored in the database. We observe that both max and average pooling are 
surprisingly effective, as their performance comes close to that of the Min-dist scheme while storing only 
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Holidays 



Fig. 11. Results of the Min-dist (PWL) scheme as step size parameter s is varied. OxfordNet layer /c6 is used on the Holidays 
data set. A piece-wise linear approximation of the manifold, on which rotated descriptors of each image lie, is used to trade-off 
performance and matching complexity. The performance of step size s = 60° is close to that of s = 10°, while reducing 
memory requirements by 6x. 


one descriptor per database image. Note that the Min-dist scheme performs the best, as there is no drop 
in performance at 0°, compared to the pooling methods. 

Next, we also propose a scheme illustrated in Figure [12] for reducing memory requirements of the 
Min-dist scheme at the expense of increased matching complexity. The scheme labeled Min-dist (PWL) 
assumes a piece-wise linear approximation of the manifold on which the descriptors of each rotated image 
lie, and computes the closest distance to the manifold. The results for the Min-dist (PWL) with step size 
10° are shown in Figure [10| and it performs comparably to the Min-dist scheme. Instead of maintaining 
database descriptors at finely quantized angular rotations of s = 10°, we increase s to 30°, 60°, 90° and 
present results in Figure [IT] We note that the performance of step size s = 60° is close to that of s = 10° 
for Min-dist (PWL), while reducing memory requirements by 6x. The drop in performance for the Min- 
dist (PWL) scheme at -135°,-45°,45°,145° for s = 90° shows inherent data set bias at these angles. In 
conclusion, the proposed simple but elegant Min-dist (PWL) scheme helps gain rotation invariance, while 
requiring storage of fewer descriptors compared to the Min-dist approach. 

The surprising effectiveness of max and average pooling to gain rotation invariance for CNN features 
led us to run the same set of experiments on FVDM. We present the results for max pooling, average 
pooling, Min-dist, and Min-dist (PWL) with step size s = 10° in Figure [13] for P = 180°. Average 
pooling on FVDM helps gain invariance to rotation while lowering peak performance achieved without 
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Fig. 12. Illustration of the different pooling schemes. The scheme labeled Min-dist (PWL) assumes a piece-wise linear 
approximation of the manifold on which the descriptors of each rotated image lie, and computes the closest distance to the 
manifold. 


pooling (P = 0). However, note the large difference in performance between max and average pooling for 
FVDM. CNN features are sparse with a small number of dimensions with high values: spikes resulting 
from the activation of neurons in the network. FVDM data are comparatively more dense. As a result, 
max pooling on OxfordNet features is far more effective than for FVDM. Finally, in Figure [13] Min-dist 
and Min-dist (PWL) perform the best, as also observed for OxfordNet features. 


Holidays 



Fig. 13. Comparison of different types of database pooling and augmentation techniques for pooling parameter P = 180 for 
FVDM on the Holidays data set. Note the difference in performance of max and average pooling for FVDM, compared to max 
and average pooling on OxfordNet features in Figure [lOjd). 
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E. Invariance to Scale Changes 

In this section, we study scale invariance properties of CNNs and FVs. Similar to rotation experiments 


in Section V-D we carry out control experiments on the Holidays data where we reduce the scale of query 
images and measure retrieval performance. The starting resolution of all images (database and queries) 


is set to the larger dimension of 640 pixels (maintaining aspect ratio), as discussed in Section IV 


Both CNN and FV pipelines take in input images at fixed resolution. For FVDoG, FVDS, FVDM 
pipelines, we resize images to VGA resolution (preserving aspect ratio) before feature extraction, even 
when the input resolution is smaller. Upsampling images before feature extraction is shown to improve 
matching performance (8j. The size of center crop input images (after resizing) to the OxfordNet pipeline 
is specified in Table |TI] 

How invariant are CNN features to scale? 

We scale query images along both image dimensions by a ratio of 0.75, 0.5, 0.375, 0.25, 0.2 and 
0.125 starting from the VGA resolution - the smallest queries are f g j the size of the VGA resolution 
image. An anti-aliasing Gaussian filter is applied, followed by bicubic interpolation in the downsampling 
operation. We present MAP for different layers of OxfordNet in Figure [14] We note that the /c6 layer 
experiences only a small drop in performance up to scale 0.25 before steeply dropping off. The OxfordNet 
features are learnt on input images of fairly low resolution, which explains the robustness to large changes 
in scale. We also note that the three fully connected layers exhibit similar characteristics for scale change: 
deeper fully connected layers are not more scale invariant. It is interesting to note that pool5 is more 
scale invariant, as seen from the more gradual drop in performance as query scale is decreased: however, 
pool5 is less discriminative with a significant performance gap for smaller scale changes (0.75 to 0.25). 

Are CNN or FV more scale invariant? 

Similar to the rotation experiment, we compare performance of OxfordNet fc6 and FVs in Figure [15] 
for the Holidays and Graphics data sets. We observe a steeper drop in performance with decreasing 
scale for FVDM and FVDS compared to OxfordNet. Somewhat surprisingly, FVDoG also experiences 
a sharper drop in performance compared to CNN. The trends are consistent across data sets: the only 
difference is that the peak performance of FVDoG is higher than CNN on Graphics. Trends similar to 
Holidays are observed on the remaining two data sets. The sharp drop in performance of FVDoG can be 
attributed to the failure of the interest point detector at small scales. CNNs arc learnt on smaller images 
to begin with, and objects shown at different scales at training time, are sufficient for achieving more 
scale invariance than FVDoG. In comparison to the rotation experiments, it is interesting to note that 
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Holidays 



Fig. 14. MAP as query images are scaled to 0.125 of original resolution, for different layers of OxfordNet on the Holidays data 
set. We note that OxfordNet features are robust to scale change up to 0.25, with performance dropping steeply after. 


FVDoG are more robust to rotation changes, while OxfordNet features are more robust to scale changes. 


Holidays 



Query Scale 


Graphics 



Fig. 15. Comparison of OxfordNet fc6 and FVs for scaled queries on the Holidays and Graphics data sets. We observe that 
OxfordNet features are more robust to scale changes compared to FVDoG, FVDS and FVDM, all of which experience a steeper 
drop in performance as query scale is decreased. 


Does database-side pooling improve scale invariance for CNN features ? 

Next, we discuss how performance at small scales can be improved by pooling descriptors on the 
database side. As illustrated in Figure [16| the component-wise pooling operation across scales is similar 
to the database-pooling performed on rotated images. The parameter SP refers to the number of scales 
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Fig. 16. We extract features from database images at different scales and pool them to obtain a single representation. We scale 
queries to different sizes, and evaluate retrieval performance for different pooling parameters and strategies. 


over which OxfordNet features are pooled. SP = n refer to pooling across the first n + 1 scales of the 
set of six scale-ratios (seven including one) (1,0.75,0.5,0.375,0.25,0.2,0.125)). SP = 1, hence, refers 
to no database pooling. 

In Figure [TTJa), we first study MAP vs query scale for different types of pooling on the Holidays data set 
for SP = 6 (pooling over all scales in consideration). OxfordNet fc6 features are used in this experiment. 
We note that max-pooling outperforms average pooling by a small margin, and comes close to the 
performance of the Min-dist scheme, which stores the descriptors of all the scaled versions of the database 
image and computes the minimum distance. Similar to the rotation experiment, the Min-dist (PWL) 
scheme, which computes the minimum distance to a piece-wise linear manifold of the CNN descriptors 
for the six scaled images, is also effective for the scale experiment. Min-dist (PWL) outperforms Min-dist 
by a small margin, as it is more robust to matching query data which lie at intermediate quantized scales. 
For SP = 6, there is a significant improvement in performance at small scales for the pooling schemes, 
with only a marginal drop in performance for points close to the original scale (seen from the right most 
points on the curve in Figure [lTJa)). 

In Figure |T7|h), we study varying pooling parameter SP for max-pooling. Performance at small 
scales increases as SP is increased, with only a marginal drop at query scale 0.75. A significant gain 
in performance of 10% is achieved for the smallest query scale 0.25, showing the effectiveness of the 
max-pooling approach. 
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Fig. 17. Performance of different database-side pooling schemes, as query scale is changed. Results reported on OxfordNet 
fc6 on the Holidays data set. SP = 1 refers to no database-pooling. SP = n refer to pooling across the first n + 1 scales 
of the set (1,0.75,0.5,0.375,0.25,0.2,0.125)). In Figure (a), we notice that max pooling comes close to the performance of 
Min-dist which requires storing descriptors at all scales. In Figure (b), we observe that performance improves at small scales 
with database side pooling as parameter SP is increased. 


VI. Open Questions 

The systematic study in this paper opens up several interesting avenues for future work. We highlight 
the most important open questions here. 

• Pre-trained CNN models trained for large-scale image classification tasks, with larger amounts of 
data have the potential of improving performance further for instance retrieval. For instance, CNN 
models trained on the full ImageNet data set with 14 million images and 10000 classes could lead 
to more discriminative features for the instance retrieval task. 

• While supervised CNN models have far outperformed their unsupervised CNN counterparts for large- 
scale image classification, the latter approach deserves careful attention in the context of instance 
retrieval. For the instance retrieval task, we desire rich representations of low level image information, 
which can be learnt directly from the large amounts of unlabelled image data available on the internet. 
As image classification is not the end goal, unsupervised CNN models trained with large amounts of 
data might achieve comparable or better performance for instance retrieval tasks. Availability of large 
amounts of training data (e.g., the Yahoo 100 million image data set 1461 ) and recent advances in 
open-source software for large-scale distributed deep learning (e.g. Torch ll47l ) will enable training of 
large-scale unsupervised CNN models. If unsupervised CNN models work well for instance retrieval, 
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they will enable easier training and adaptation to different types of image databases. 

• Rotation and scale invariance are key to instance retrieval tasks. While the database pooling schemes 
proposed in this work are highly effective, they are more of an after-thought to solving the invariance 
problem in the CNN context. Learning CNN representations which are inherently scale and rotation 
invariant is an exciting direction to pursue. 

• Interest point detectors provide an efficient and effective way of achieving desired levels of invari¬ 
ance (ranging from scale and rotation invariance to affine invariance). The carefully hand crafted 
SIFT descriptor has been remarkably effective for the instance retrieval task: however patch level 
descriptors can now be learnt with large amounts of data, using data sets like the Winder and Brown 
patch data sets l48l . and the Stanford Mobile Visual Search patch data set ||49l . A hybrid approach 
of interest point detectors with learnt CNN descriptor representations could lead to a significant 
improvement in retrieval performance. 

• Hybrid interest point detection schemes like the dense interest point detector proposed originally 
in f50ll need to be revisited, in light of the effectiveness of CNN features which are extracted by 
dense sampling in the image. A recent survey of dense interest point detectors iBTIl is a good starting 
point. 

• Finally, our study has demonstrated that unlike for large scale image classification, combining CNNs 
with “less effective” types of descriptors such as FVs is a valid way to improve retrieval performance. 
We point to readers the recent work from Gong et al. lf52l and Xu et al. ll53Tl in this field who have 
proposed effective FV/VLAD style encoding schemes for CNN descriptors. 

VII. Conclusions 

In this work, we proposed a systematic and in-depth evaluation of FV and CNN pipelines for image 
retrieval. Our study has lead to a comprehensive set of practical guidelines we believe can be useful to 
anyone seeking to implement state-of-the-art descriptors for image retrieval. Some of the recommendations 
are general good practices while others are more problem specific. 

We also showed that unlike image classification, the supremacy of CNNs over FVs does not always 
verify in the case of image retrieval and strategies mixing both approaches are most likely optimal. 
In particular, the lack of transformation invariance of the descriptors appears to be one of the main 
drawbacks of CNNs. We managed to propose a number of simple and effective approaches which can 
be followed to patch these deficiencies. Nevertheless, we believe that better integrating invariance is key 
to the improvement of performance. 
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